library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ──────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.0.4 ✓ dplyr 1.0.2
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.0
package ‘ggplot2’ was built under R version 3.6.2package ‘tibble’ was built under R version 3.6.2package ‘tidyr’ was built under R version 3.6.2package ‘readr’ was built under R version 3.6.2package ‘purrr’ was built under R version 3.6.2package ‘dplyr’ was built under R version 3.6.2── Conflicts ─────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
house <- read_csv("data/kc_house_data.csv")
── Column specification ─────────────────────────────────────────────────────────
cols(
.default = col_double(),
id = col_character(),
date = col_datetime(format = "")
)
ℹ Use `spec()` for the full column specifications.
house
house_tidy <- house %>%
select(-c(id, date, sqft_living15, sqft_lot15, zipcode)) %>%
mutate(waterfront = as.logical(waterfront)) %>%
mutate(renovated = ifelse(yr_renovated > 0, TRUE, FALSE)) %>%
select(-yr_renovated) %>%
mutate(condition = as_factor(condition)) %>%
mutate(grade = as_factor(grade))
house_tidy
Check for aliased variables using the alias() function (this takes in a formula object and a data set). [Hint - formula price ~ . says ‘price varying with all predictors’, this is a suitable input to alias()]. Remove variables that lead to an alias. Check the ‘Elements of multiple regression’ lesson for a dropdown containing further information on finding aliased variables in a dataset.
alias(price ~., house_tidy)
Model :
price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
waterfront + view + condition + grade + sqft_above + sqft_basement +
yr_built + lat + long + renovated
Complete :
(Intercept) bedrooms bathrooms sqft_living sqft_lot floors
sqft_basement 0 0 0 1 0 0
waterfrontTRUE view condition2 condition3 condition4 condition5
sqft_basement 0 0 0 0 0 0
grade3 grade4 grade5 grade6 grade7 grade8 grade9 grade10 grade11
sqft_basement 0 0 0 0 0 0 0 0 0
grade12 grade13 sqft_above yr_built lat long renovatedTRUE
sqft_basement 0 0 -1 0 0 0 0
house_data <- house_tidy %>%
select(-sqft_living)
alias(price ~., house_data)
Model :
price ~ bedrooms + bathrooms + sqft_lot + floors + waterfront +
view + condition + grade + sqft_above + sqft_basement + yr_built +
lat + long + renovated
library(GGally)
package ‘GGally’ was built under R version 3.6.2Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
houses_tidy_numeric <- house_data %>%
select_if(is.numeric)
houses_tidy_nonnumeric <- house_data %>%
select_if(function(x) !is.numeric(x))
houses_tidy_nonnumeric$price <- house_data$price
ggpairs(houses_tidy_numeric, progress = FALSE)
ggpairs(houses_tidy_nonnumeric, progress = FALSE)
model1 <- lm(price ~ grade, data = house_data)
summary(model1)
Call:
lm(formula = price ~ grade, data = house_data)
Residuals:
Min 1Q Median 3Q Max
-1929615 -135853 -35090 89080 5565658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 142000 254499 0.558 0.576878
grade3 63667 293870 0.217 0.828484
grade4 72381 258849 0.280 0.779767
grade5 106524 255024 0.418 0.676169
grade6 159920 254561 0.628 0.529868
grade7 260590 254513 1.024 0.305904
grade8 400853 254520 1.575 0.115285
grade9 631513 254547 2.481 0.013112 *
grade10 929771 254611 3.652 0.000261 ***
grade11 1354842 254817 5.317 1.07e-07 ***
grade12 2049222 255909 8.008 1.23e-15 ***
grade13 3567615 264106 13.508 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 254500 on 21601 degrees of freedom
Multiple R-squared: 0.5197, Adjusted R-squared: 0.5195
F-statistic: 2125 on 11 and 21601 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model1)
not plotting observations with leverage one:
19453
null_model <- lm(price ~ 1, data = house_data)
grade_model <- lm(price ~ grade, data = house_data)
anova(null_model, grade_model)
Analysis of Variance Table
Model 1: price ~ 1
Model 2: price ~ grade
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21612 2.9129e+15
2 21601 1.3991e+15 11 1.5138e+15 2124.8 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
library(modelr)
price_remaining_resid <- house_data %>%
add_residuals(model1) %>%
select(-c(price, grade))
houses_resid_numeric <- price_remaining_resid %>%
select_if(is.numeric)
houses_resid_nonnumeric <- price_remaining_resid %>%
select_if(function(x) !is.numeric(x))
houses_resid_nonnumeric$resid <- price_remaining_resid$resid
houses_resid_nonnumeric %>%
ggpairs(aes( alpha = 0.5), progress = FALSE)
houses_resid_numeric %>%
ggpairs(aes( alpha = 0.5), progress = FALSE)
model2 <- lm(price ~ grade + lat, data = house_data)
summary(model2)
Call:
lm(formula = price ~ grade + lat, data = house_data)
Residuals:
Min 1Q Median 3Q Max
-1863783 -112379 -28181 67454 5533910
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -29949769 610666 -49.044 < 2e-16 ***
grade3 187923 276126 0.681 0.496151
grade4 95367 243212 0.392 0.694978
grade5 126354 239618 0.527 0.597980
grade6 158453 239183 0.662 0.507673
grade7 246275 239138 1.030 0.303093
grade8 378635 239144 1.583 0.113369
grade9 603010 239170 2.521 0.011701 *
grade10 891752 239230 3.728 0.000194 ***
grade11 1311124 239425 5.476 4.4e-08 ***
grade12 2007093 240450 8.347 < 2e-16 ***
grade13 3502099 248154 14.113 < 2e-16 ***
lat 633100 11822 53.554 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 239100 on 21600 degrees of freedom
Multiple R-squared: 0.576, Adjusted R-squared: 0.5758
F-statistic: 2445 on 12 and 21600 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model2)
not plotting observations with leverage one:
19453
price_remaining_resid <- house_data%>%
add_residuals(model2) %>%
select(-c(price, grade, lat))
houses_resid_numeric <- price_remaining_resid %>%
select_if(is.numeric)
houses_resid_nonnumeric <- price_remaining_resid %>%
select_if(function(x) !is.numeric(x))
houses_resid_nonnumeric$resid <- price_remaining_resid$resid
houses_resid_nonnumeric %>%
ggpairs(aes( alpha = 0.5), progress = FALSE)
model3 <- lm(price ~ grade + bathrooms+ renovated, data = house_data)
summary(model3)
Call:
lm(formula = price ~ grade + bathrooms + renovated, data = house_data)
Residuals:
Min 1Q Median 3Q Max
-842037 -150556 -33888 101464 5621208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -980784 12600 -77.84 <2e-16 ***
grade 179392 2072 86.56 <2e-16 ***
bathrooms 65594 3167 20.71 <2e-16 ***
renovatedTRUE 202725 9052 22.40 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 267400 on 21609 degrees of freedom
Multiple R-squared: 0.4696, Adjusted R-squared: 0.4696
F-statistic: 6378 on 3 and 21609 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model3)
anova(model2, model3)
Analysis of Variance Table
Model 1: price ~ grade + bathrooms
Model 2: price ~ grade + bathrooms + renovated
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21610 1.5808e+15
2 21609 1.5449e+15 1 3.5859e+13 501.57 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
price_remaining_resid <- house_data %>%
add_residuals(model3) %>%
select(-c(price, grade, bathrooms, renovated))
price_remaining_resid %>%
ggpairs(aes( alpha = 0.5))
price_resid <- house_data %>%
add_residuals(model3) %>%
select(-price)
price_resid %>%
ggplot(aes(x = grade, y = resid, color = renovated)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
price_resid %>%
ggplot(aes(x = bathrooms, y = resid, color = renovated)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
model4 <- lm(price ~ grade + bathrooms+ renovated +grade:renovated, data = house_data)
summary(model4)
Call:
lm(formula = price ~ grade + bathrooms + renovated + grade:renovated,
data = house_data)
Residuals:
Min 1Q Median 3Q Max
-1347847 -149614 -31849 100686 5068199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -927359 12704 -73.00 <2e-16 ***
grade 172767 2073 83.36 <2e-16 ***
bathrooms 64302 3133 20.52 <2e-16 ***
renovatedTRUE -1048293 58147 -18.03 <2e-16 ***
grade:renovatedTRUE 161832 7432 21.77 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 264500 on 21608 degrees of freedom
Multiple R-squared: 0.481, Adjusted R-squared: 0.4809
F-statistic: 5007 on 4 and 21608 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model4)
model5 <- lm(price ~ grade + bathrooms+ renovated+ bathrooms:renovated, data = house_data)
summary(model5)
Call:
lm(formula = price ~ grade + bathrooms + renovated + bathrooms:renovated,
data = house_data)
Residuals:
Min 1Q Median 3Q Max
-1117582 -148986 -32358 100389 5170866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -964299 12527 -76.976 <2e-16 ***
grade 180570 2056 87.809 <2e-16 ***
bathrooms 53490 3205 16.690 <2e-16 ***
renovatedTRUE -237909 24896 -9.556 <2e-16 ***
bathrooms:renovatedTRUE 192637 10152 18.976 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 265200 on 21608 degrees of freedom
Multiple R-squared: 0.4783, Adjusted R-squared: 0.4782
F-statistic: 4953 on 4 and 21608 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model5)
library(relaimpo)
Loading required package: MASS
package ‘MASS’ was built under R version 3.6.2
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
Loading required package: boot
Loading required package: survey
package ‘survey’ was built under R version 3.6.2Loading required package: grid
Loading required package: Matrix
Attaching package: ‘Matrix’
The following objects are masked from ‘package:tidyr’:
expand, pack, unpack
Loading required package: survival
Attaching package: ‘survival’
The following object is masked from ‘package:boot’:
aml
Attaching package: ‘survey’
The following object is masked from ‘package:graphics’:
dotchart
Loading required package: mitools
This is the global version of package relaimpo.
If you are a non-US user, a version with the interesting additional metric pmvd is available
from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.
calc.relimp(model4, type = "lmg", rela = TRUE)
Response variable: price
Total response variance: 134782378397
Analysis based on 21613 observations
4 Regressors:
grade bathrooms renovatedTRUE grade:renovatedTRUE
Proportion of variance explained by model: 48.1%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
grade 0.72038621
bathrooms 0.22725691
renovatedTRUE 0.02846745
grade:renovatedTRUE 0.02388942
Average coefficients for different model sizes:
1X 2Xs 3Xs 4Xs
grade 208457.6 193065.4 190011.73 172767.2
bathrooms 250326.5 158707.2 65594.01 64302.5
renovatedTRUE 230018.2 197668.2 -428907.28 -1048292.9
grade:renovatedTRUE NaN NaN 164719.43 161832.1
houses_resid %>%
ggplot(aes(x = sqft_basement, y = resid, colour = grade)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ grade)